1 Where Do People Drink The Most Beer, Wine And Spirits?

library(fivethirtyeight)
data(drinks)

The drinks data has 1 character variable and 4 numeric variables and there are no missing values we should worry about.

skim(drinks)
Data summary
Name drinks
Number of rows 193
Number of columns 5
_______________________
Column type frequency:
character 1
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
country 0 1 3 28 0 193 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
beer_servings 0 1 106.16 101.14 0 20.0 76.0 188.0 376.0 ▇▃▂▂▁
spirit_servings 0 1 80.99 88.28 0 4.0 56.0 128.0 438.0 ▇▃▂▁▁
wine_servings 0 1 49.45 79.70 0 1.0 8.0 59.0 370.0 ▇▁▁▁▁
total_litres_of_pure_alcohol 0 1 4.72 3.77 0 1.3 4.2 7.2 14.4 ▇▃▅▃▁

Below is a plot of the top 25 beer consuming countries.

drinks_beer <- drinks %>% 
  arrange(desc(beer_servings)) %>% 
  head(25)
ggplot(drinks_beer, aes(y = reorder(country, beer_servings), x = beer_servings)) + 
  geom_col() + 
  labs(title = "Global Beer Consumption", 
       y = "",
       x = "Beer Servings",
       caption = "Source: FiveThirtyEight") +
  theme_economist()

Next is a plot that shows the top 25 wine consuming countries.

drinks_wine <- drinks %>% 
  arrange(desc(wine_servings)) %>% 
  head(25)
ggplot(drinks_wine, aes(y = reorder(country, wine_servings), x = wine_servings)) + 
  geom_col() + 
  labs(title = "Global Wine Consumption", 
       y = "",
       x = "Wine Servings",
       caption = "Source: FiveThirtyEight") +
  theme_economist()

Finally, a plot that shows the top 25 spirit consuming countries.

drinks_spirit <- drinks %>% 
  arrange(desc(spirit_servings)) %>% 
  head(25)
ggplot(drinks_spirit, aes(y = reorder(country, spirit_servings), x = spirit_servings)) + 
  geom_col() + 
  labs(title = "Global Spirit Consumption", 
       y = "",
       x = "Spirit Servings",
       caption = "Source: FiveThirtyEight") +
  theme_economist()

Across the board, there appears to be a strong cultural bias to the types of alcohol that countries consume. For example, Germany is fourth on the chart for global annual beer consumption per person and France tops the wine consumption chart. Furthermore, particularly for beer and spirit consumption, countries that are culturally tied through history have similar consumption patterns. Namibia tops the chart for beer consumption, likely due to its status as a former German colony. A similar pattern exists on the spirits chart as well. Many of the countries in the former Soviet Union and Eastern Bloc appear on that chart, likely due to the common consumption of vodka.

Contrary to the beer and spirit categories, however, wine consumption is much less tied to shared cultural history and seems more reliant on shared geography and geographic proximity. Because wine is an alcoholic beverage that requires grapes grown in specific climates, it is not as easy to produce across the world. Therefore, the consumption of wine is heavily concentrated in Europe and thus close to the wine producing regions of France, Italy, and Portugal.

2 Analysis of movies- IMDB dataset

We will look at a subset sample of movies, taken from the Kaggle IMDB 5000 movie dataset.

Besides the obvious variables of title, genre, director, year, and duration, the rest of the variables are as follows:

  • gross : The gross earnings in the US box office, not adjusted for inflation
  • budget: The movie’s budget
  • cast_facebook_likes: the number of Facebook likes cast members received
  • votes: the number of people who voted for (or rated) the movie in IMDB
  • reviews: the number of reviews for that movie
  • rating: IMDB average rating

There are no missing values in the data. While there are duplicate entries for each of the three character variables, these are expected.

movies <- read_csv(here("data", "movies.csv"))
skim(movies)
Data summary
Name movies
Number of rows 2961
Number of columns 11
_______________________
Column type frequency:
character 3
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
title 0 1 1 83 0 2907 0
genre 0 1 5 11 0 17 0
director 0 1 3 32 0 1366 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2.00e+03 9.95e+00 1920.0 2.00e+03 2.00e+03 2.01e+03 2.02e+03 ▁▁▁▂▇
duration 0 1 1.10e+02 2.22e+01 37.0 9.50e+01 1.06e+02 1.19e+02 3.30e+02 ▃▇▁▁▁
gross 0 1 5.81e+07 7.25e+07 703.0 1.23e+07 3.47e+07 7.56e+07 7.61e+08 ▇▁▁▁▁
budget 0 1 4.06e+07 4.37e+07 218.0 1.10e+07 2.60e+07 5.50e+07 3.00e+08 ▇▂▁▁▁
cast_facebook_likes 0 1 1.24e+04 2.05e+04 0.0 2.24e+03 4.60e+03 1.69e+04 6.57e+05 ▇▁▁▁▁
votes 0 1 1.09e+05 1.58e+05 5.0 1.99e+04 5.57e+04 1.33e+05 1.69e+06 ▇▁▁▁▁
reviews 0 1 5.03e+02 4.94e+02 2.0 1.99e+02 3.64e+02 6.31e+02 5.31e+03 ▇▁▁▁▁
rating 0 1 6.39e+00 1.05e+00 1.6 5.80e+00 6.50e+00 7.10e+00 9.30e+00 ▁▁▆▇▁

Below is a table with a count of movies by genre, ranked in descending order.

movies_genre <- movies %>% 
  count(genre) %>% 
  arrange(desc(n))
movies_genre
## # A tibble: 17 x 2
##    genre           n
##    <chr>       <int>
##  1 Comedy        848
##  2 Action        738
##  3 Drama         498
##  4 Adventure     288
##  5 Crime         202
##  6 Biography     135
##  7 Horror        131
##  8 Animation      35
##  9 Fantasy        28
## 10 Documentary    25
## 11 Mystery        16
## 12 Sci-Fi          7
## 13 Family          3
## 14 Musical         2
## 15 Romance         2
## 16 Western         2
## 17 Thriller        1

Next is a table with the average gross earning and budget by genre. The variable return_on_budget shows how many $ a movie made at the box office for each $ of its budget. Genres are ranked by this return_on_budget in descending order.

movies_genre_gross <- movies %>% 
  group_by(genre) %>% 
  summarise(avg_gross = mean(gross), avg_budget = mean(budget)) %>% 
  mutate(return_on_budget = avg_gross/avg_budget) %>% 
  arrange(desc(return_on_budget))
movies_genre_gross
## # A tibble: 17 x 4
##    genre        avg_gross avg_budget return_on_budget
##    <chr>            <dbl>      <dbl>            <dbl>
##  1 Musical      92084000    3189500          28.9    
##  2 Family      149160478.  14833333.         10.1    
##  3 Western      20821884    3465000           6.01   
##  4 Documentary  17353973.   5887852.          2.95   
##  5 Horror       37713738.  13504916.          2.79   
##  6 Fantasy      42408841.  17582143.          2.41   
##  7 Comedy       42630552.  24446319.          1.74   
##  8 Mystery      67533021.  39218750           1.72   
##  9 Animation    98433792.  61701429.          1.60   
## 10 Biography    45201805.  28543696.          1.58   
## 11 Adventure    95794257.  66290069.          1.45   
## 12 Drama        37465371.  26242933.          1.43   
## 13 Crime        37502397.  26596169.          1.41   
## 14 Romance      31264848.  25107500           1.25   
## 15 Action       86583860.  71354888.          1.21   
## 16 Sci-Fi       29788371.  27607143.          1.08   
## 17 Thriller         2468     300000           0.00823

Here is a table that shows the top 15 directors who have created the highest gross revenue in the box office, along with the mean, median, and standard deviation of gross revenue per director.

movies_directors <- movies %>% 
  group_by(director) %>% 
  summarise(sum_gross = sum(gross), avg_gross = mean(gross), median_gross = median(gross), sd_gross = STDEV(gross)) %>% 
  arrange(desc(sum_gross)) %>% 
  head(15)
movies_directors
## # A tibble: 15 x 5
##    director           sum_gross  avg_gross median_gross   sd_gross
##    <chr>                  <dbl>      <dbl>        <dbl>      <dbl>
##  1 Steven Spielberg  4014061704 174524422.   164435221  101421051.
##  2 Michael Bay       2231242537 171634041.   138396624  127161579.
##  3 Tim Burton        2071275480 129454718.    76519172  108726924.
##  4 Sam Raimi         2014600898 201460090.   234903076  162126632.
##  5 James Cameron     1909725910 318287652.   175562880. 309171337.
##  6 Christopher Nolan 1813227576 226653447    196667606. 187224133.
##  7 George Lucas      1741418480 348283696    380262555  146193880.
##  8 Robert Zemeckis   1619309108 124562239.   100853835   91300279.
##  9 Clint Eastwood    1378321100  72543216.    46700000   75487408.
## 10 Francis Lawrence  1358501971 271700394.   281666058  135437020.
## 11 Ron Howard        1335988092 111332341    101587923   81933761.
## 12 Gore Verbinski    1329600995 189942999.   123207194  154473822.
## 13 Andrew Adamson    1137446920 284361730    279680930. 120895765.
## 14 Shawn Levy        1129750988 102704635.    85463309   65484773.
## 15 Ridley Scott      1128857598  80632686.    47775715   68812285.

Finally, ratings. This is a table that describes how ratings are distributed by genre, including the mean, min, max, median, SD and a density graph that visually shows how ratings are distributed.

movies_ratings <- movies %>% 
  group_by(genre) %>% 
  summarise(avg_rating = mean(rating), 
            min_rating = min(rating), 
            max_rating = max(rating),
            median_rating = median(rating),
            st_rating = STDEV(rating))
movies_ratings
## # A tibble: 17 x 6
##    genre       avg_rating min_rating max_rating median_rating st_rating
##    <chr>            <dbl>      <dbl>      <dbl>         <dbl>     <dbl>
##  1 Action            6.23        2.1        9            6.3      1.03 
##  2 Adventure         6.51        2.3        8.6          6.6      1.09 
##  3 Animation         6.65        4.5        8            6.9      0.968
##  4 Biography         7.11        4.5        8.9          7.2      0.760
##  5 Comedy            6.11        1.9        8.8          6.2      1.02 
##  6 Crime             6.92        4.8        9.3          6.9      0.849
##  7 Documentary       6.66        1.6        8.5          7.4      1.77 
##  8 Drama             6.73        2.1        8.8          6.8      0.917
##  9 Family            6.5         5.7        7.9          5.9      1.22 
## 10 Fantasy           6.15        4.3        7.9          6.45     0.959
## 11 Horror            5.83        3.6        8.5          5.9      1.01 
## 12 Musical           6.75        6.3        7.2          6.75     0.636
## 13 Mystery           6.86        4.6        8.5          6.9      0.882
## 14 Romance           6.65        6.2        7.1          6.65     0.636
## 15 Sci-Fi            6.66        5          8.2          6.4      1.09 
## 16 Thriller          4.8         4.8        4.8          4.8     NA    
## 17 Western           5.70        4.1        7.3          5.70     2.26
ggplot(movies, aes(x = rating)) +
           geom_density() + 
  labs(title = "Distribution of Movie Ratings", x = "Average Rating", y = "", caption = "Source: IMDB") +
  theme_economist()

Below is a scatter plot examining the relationship between gross revenue and cast_facebook_likes. Based on the scatter plot and associated trend line, there appears to be a weak positive correlation between cast_facebook_likes and gross. We mapped cast_facebook_likes to the X-axis and gross to the Y-axis because one would expect cast Facebook activity to drive gross earnings, not the opposite.

ggplot(movies, aes(y = gross, x = cast_facebook_likes)) + 
  geom_point() + 
  scale_x_log10() + 
  scale_y_log10() + 
  geom_smooth(method = "lm") +
  labs(title = "Cast Facebook Activity and Gross Revenue", x = "Cast Facebook Likes", y = "Gross Revenue ($)", caption = "Source: IMDB") +
  theme_economist()

Next is a scatter plot showing the relationship between gross and budget, indicating that budget appears to be a good predictor of how much a movie will make.

ggplot(movies, aes(x = budget, y = gross)) + 
  geom_point() + 
  scale_x_log10() + 
  scale_y_log10() + 
  geom_smooth(method = "lm") +
  labs(title = "Budget and Gross Revenue", x = "Budget ($)", y = "Gross Revenue ($)", caption = "Source: IMDB") +
  theme_economist()

Finally, here is a series of scatter plots examining the relationship between gross and rating. IMDB ratings appear to be a weak predictor of gross earnings for a few genres, most notably action. A strange aspect of this data set is that the rating variable is limited in its scope, ranging from 0 to 10. There also appears to be a concentration of ratings around the 7 to 9 range. Finally, many of the genres do not have enough films to gather any meaningful insight.

ggplot(movies, aes(x = rating, y = gross)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  scale_y_log10() + 
  facet_wrap("genre") + 
  labs(title = "Ratings and Gross Revenue", subtitle = "By Genre", x = "Average Rating", y = "Gross Revenue ($)", caption = "Source: IMDB") +
  theme_economist()

3 Returns of financial stocks

We will use the tidyquant package to download historical data of stock prices, calculate returns, and examine the distribution of returns.

We must first identify which stocks we want to download data for, and for this we must know their ticker symbol; Apple is known as AAPL, Microsoft as MSFT, McDonald’s as MCD, etc. The file nyse.csv contains 508 stocks listed on the NYSE, their ticker symbol, name, the IPO (Initial Public Offering) year, and the sector and industry the company is in.

nyse <- read_csv(here::here("data","nyse.csv"))

Below is a table and a bar plot that shows the number of companies per sector, in descending order

nyse_sector <- nyse %>% 
  group_by(sector) %>% 
  summarise(sector_count = count(sector)) %>% 
  arrange(desc(sector_count))
nyse_sector
## # A tibble: 12 x 2
##    sector                sector_count
##    <chr>                        <int>
##  1 Finance                         97
##  2 Consumer Services               79
##  3 Public Utilities                60
##  4 Capital Goods                   45
##  5 Health Care                     45
##  6 Energy                          42
##  7 Technology                      40
##  8 Basic Industries                39
##  9 Consumer Non-Durables           31
## 10 Miscellaneous                   12
## 11 Transportation                  10
## 12 Consumer Durables                8
ggplot(nyse_sector, aes(x = sector_count, y = reorder(sector, sector_count))) +
      geom_col() + 
  labs(title = "S&P 500 Sector Representation", y = "", x = "Companies", caption = "Source: NYSE") +
  theme_economist()

Next, let’s choose the Dow Jones Industrial Average (DJIA) stocks and their ticker symbols and download some data. Besides the thirty stocks that make up the DJIA, we will also add SPY which is an SP500 ETF (Exchange Traded Fund).

djia_url <- "https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average"
tables <- djia_url %>% 
  read_html() %>% 
  html_nodes(css="table")
djia <- map(tables, . %>% 
               html_table(fill=TRUE)%>% 
               clean_names())
table1 <- djia[[2]] %>%
  mutate(date_added = ymd(date_added),
         ticker = ifelse(str_detect(symbol, "NYSE*"),
                          str_sub(symbol,7,11),
                          symbol))
tickers <- table1 %>% 
  select(ticker) %>% 
  pull() %>%
  c("SPY")
myStocks <- tickers %>% 
  tq_get(get  = "stock.prices",
         from = "2000-01-01",
         to   = "2020-08-31") %>%
  group_by(symbol) 
glimpse(myStocks)
## Rows: 153,121
## Columns: 8
## Groups: symbol [31]
## $ symbol   <chr> "MMM", "MMM", "MMM", "MMM", "MMM", "MMM", "MMM", "MMM", "MMM…
## $ date     <date> 2000-01-03, 2000-01-04, 2000-01-05, 2000-01-06, 2000-01-07,…
## $ open     <dbl> 48.0, 46.4, 45.6, 47.2, 50.6, 50.2, 50.4, 51.0, 50.7, 50.4, …
## $ high     <dbl> 48.2, 47.4, 48.1, 51.2, 51.9, 51.8, 51.2, 51.8, 50.9, 50.5, …
## $ low      <dbl> 47.0, 45.3, 45.6, 47.2, 50.0, 50.0, 50.2, 50.4, 50.2, 49.5, …
## $ close    <dbl> 47.2, 45.3, 46.6, 50.4, 51.4, 51.1, 50.2, 50.4, 50.4, 49.7, …
## $ volume   <dbl> 2173400, 2713800, 3699400, 5975800, 4101200, 3863800, 235760…
## $ adjusted <dbl> 28.1, 26.9, 27.7, 30.0, 30.5, 30.4, 29.9, 30.0, 30.0, 29.5, …

Financial performance analysis depend on returns; If I buy a stock today for $100 and I sell it tomorrow for $101.75, my one-day return, assuming no transaction costs, is 1.75%. So given the adjusted closing prices, our first step is to calculate daily and monthly returns.

myStocks_returns_daily <- myStocks %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "daily", 
               type       = "log",
               col_rename = "daily_returns",
               cols = c(nested.col))  
myStocks_returns_monthly <- myStocks %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "monthly", 
               type       = "arithmetic",
               col_rename = "monthly_returns",
               cols = c(nested.col)) 
myStocks_returns_annual <- myStocks %>%
  group_by(symbol) %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "yearly", 
               type       = "arithmetic",
               col_rename = "yearly_returns",
               cols = c(nested.col))

Next is a data frame summarizing monthly returns since 2017-01-01 for each of the stocks and SPY, comprised of min, max, median, mean, and SD.

djia_2017 <- myStocks_returns_monthly %>% 
  filter(date >= "2017-01-01") %>% 
  group_by(symbol) %>% 
  summarise(min = min(monthly_returns), max = max(monthly_returns), median = median(monthly_returns), avg = mean(monthly_returns), std_dev = STDEV(monthly_returns))
djia_2017
## # A tibble: 31 x 6
##    symbol    min    max   median       avg std_dev
##    <chr>   <dbl>  <dbl>    <dbl>     <dbl>   <dbl>
##  1 AAPL   -0.181 0.200   0.0513   0.0387    0.0873
##  2 AMGN   -0.104 0.180   0.0235   0.0171    0.0664
##  3 AXP    -0.221 0.0988  0.0150   0.0109    0.0639
##  4 BA     -0.458 0.257   0.0250   0.0124    0.120 
##  5 CAT    -0.199 0.138   0.0318   0.0151    0.0742
##  6 CRM    -0.155 0.391   0.0403   0.0350    0.0850
##  7 CSCO   -0.155 0.130   0.0185   0.0125    0.0673
##  8 CVX    -0.224 0.270   0.00173 -0.000926  0.0748
##  9 DIS    -0.179 0.234  -0.00208  0.00967   0.0750
## 10 DOW    -0.276 0.255   0.0456   0.00898   0.128 
## # … with 21 more rows

Here is a series of density plots broken down by each of the stocks in the DJIA.

plot_returns <- ggplot(myStocks_returns_monthly, aes(x=monthly_returns)) + 
  geom_density() +
  labs(title = "Distribution of Monthly Returns", x = "Monthly Return", y = "", caption = "Source: NYSE") +
  facet_wrap("symbol") + 
  theme_economist()
plot_returns  

Regarding the plots, we can see that returns generally range between +/-25%, being highly concentrated around a 0% return. The riskiest stock in the DJIA appears to be Dow Inc. due to its higher standard deviation, although that may be due to its existence as an independent company commencing only in March 2019. The least risky “stock” is SPY, the S&P 500 index fund. This is no surprise, as it generally represents a very diversified basket of individual companies.

Finally, below is a plot that shows the expected monthly return (mean) of a stock on the Y axis and the risk (standard deviation) on the X-axis.

plot_expected <- ggplot(djia_2017, aes(x = std_dev,y = avg, label = symbol)) + 
  geom_point() + 
  labs(title = "Risk and Monthly Returns", x = "Risk", y = "Expected monthly return", caption = "Source: NYSE") +
  ggrepel::geom_text_repel() + 
  geom_smooth(method = "lm") +
  theme_economist()
plot_expected

While there is certainly not a strong, positive linear relationship between riskiness and returns, possibly due to the DJIA being a rather small dataset, there are two stocks, in particular, that should have a higher expected return based on their riskiness. The Boeing Company and Dow Inc. should have higher returns for how risky they appear to be in this dataset, but there are rather simple explanations for both. Dow Inc., as discussed above, has only been an independent company since March 2019, thus limiting the number of months of returns that are available and driving up its standard deviation. The Boeing Company has seen a rather high-profile scandal in the past two years result in a dramatic drop in its stock price, dragging down its monthly returns over the period in question.

4 IBM HR Analytics

We will analyze a data set on Human Resource Analytics. The IBM HR Analytics Employee Attrition & Performance data set is a fictional data set created by IBM data scientists. Among other things, the data set includes employees’ income, their distance from work, their position in the company, their level of education, etc. A full description can be found on the website.

First let us load the data.

hr_dataset <- read_csv(here::here("data", "datasets_1067_1925_WA_Fn-UseC_-HR-Employee-Attrition.csv"))
glimpse(hr_dataset)
## Rows: 1,470
## Columns: 35
## $ Age                      <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, …
## $ Attrition                <chr> "Yes", "No", "Yes", "No", "No", "No", "No", …
## $ BusinessTravel           <chr> "Travel_Rarely", "Travel_Frequently", "Trave…
## $ DailyRate                <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358…
## $ Department               <chr> "Sales", "Research & Development", "Research…
## $ DistanceFromHome         <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26,…
## $ Education                <dbl> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3,…
## $ EducationField           <chr> "Life Sciences", "Life Sciences", "Other", "…
## $ EmployeeCount            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ EmployeeNumber           <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16…
## $ EnvironmentSatisfaction  <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3,…
## $ Gender                   <chr> "Female", "Male", "Male", "Female", "Male", …
## $ HourlyRate               <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, …
## $ JobInvolvement           <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2,…
## $ JobLevel                 <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1,…
## $ JobRole                  <chr> "Sales Executive", "Research Scientist", "La…
## $ JobSatisfaction          <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3,…
## $ MaritalStatus            <chr> "Single", "Married", "Single", "Married", "M…
## $ MonthlyIncome            <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 26…
## $ MonthlyRate              <dbl> 19479, 24907, 2396, 23159, 16632, 11864, 996…
## $ NumCompaniesWorked       <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5,…
## $ Over18                   <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y",…
## $ OverTime                 <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Yes"…
## $ PercentSalaryHike        <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, …
## $ PerformanceRating        <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3,…
## $ RelationshipSatisfaction <dbl> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2,…
## $ StandardHours            <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, …
## $ StockOptionLevel         <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0,…
## $ TotalWorkingYears        <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, …
## $ TrainingTimesLastYear    <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4,…
## $ WorkLifeBalance          <dbl> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3,…
## $ YearsAtCompany           <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4…
## $ YearsInCurrentRole       <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2,…
## $ YearsSinceLastPromotion  <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0,…
## $ YearsWithCurrManager     <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3,…

We are going to clean the data set, as variable names are in capital letters, some variables are not really necessary, and some variables, e.g., education are given as a number rather than a more useful description.

hr_cleaned <- hr_dataset %>% 
  clean_names() %>% 
  mutate(
    education = case_when(
      education == 1 ~ "Below College",
      education == 2 ~ "College",
      education == 3 ~ "Bachelor",
      education == 4 ~ "Master",
      education == 5 ~ "Doctor"
    ),
    environment_satisfaction = case_when(
      environment_satisfaction == 1 ~ "Low",
      environment_satisfaction == 2 ~ "Medium",
      environment_satisfaction == 3 ~ "High",
      environment_satisfaction == 4 ~ "Very High"
    ),
    job_satisfaction = case_when(
      job_satisfaction == 1 ~ "Low",
      job_satisfaction == 2 ~ "Medium",
      job_satisfaction == 3 ~ "High",
      job_satisfaction == 4 ~ "Very High"
    ),
    performance_rating = case_when(
      performance_rating == 1 ~ "Low",
      performance_rating == 2 ~ "Good",
      performance_rating == 3 ~ "Excellent",
      performance_rating == 4 ~ "Outstanding"
    ),
    work_life_balance = case_when(
      work_life_balance == 1 ~ "Bad",
      work_life_balance == 2 ~ "Good",
      work_life_balance == 3 ~ "Better",
      work_life_balance == 4 ~ "Best"
    )
  ) %>% 
  select(age, attrition, daily_rate, department,
         distance_from_home, education,
         gender, job_role,environment_satisfaction,
         job_satisfaction, marital_status,
         monthly_income, num_companies_worked, percent_salary_hike,
         performance_rating, total_working_years,
         work_life_balance, years_at_company,
         years_since_last_promotion)

First, we will calculate the attrition rate for employees.

hr_attrition <- hr_cleaned %>% 
  group_by(attrition) %>% 
  summarise(leave = count(attrition))
hr_attrition
## # A tibble: 2 x 2
##   attrition leave
##   <chr>     <int>
## 1 No         1233
## 2 Yes         237
attrition_rate <- (237 / (237 + 1233)) * 100
attrition_rate
## [1] 16.1

Assuming the data set is describing one year of HR data, the attrition rate for the year is 16.1%, calculated by taking the share of employees that left the company over the total number of employees.

Next, we will generate abar chart of the mean income by education level.

hr_income <- hr_cleaned %>% 
  group_by(education) %>% 
  summarise(avg_income = mean(monthly_income)) %>% 
  arrange(desc(avg_income)) %>% 
  ggplot(aes(x = avg_income, y = reorder(education, avg_income))) +
  geom_col() +
  labs(title = "Average Monthly Income by Educaiton", x = "Average Monthly Income", y = "", caption = "Source: IBM") +
  theme_economist()
hr_income

Finally, we will create a violin plot overlaid by a box plot, indicating both the distribution of incomes at each education level in the data set and the median and inter-quartile ranges for each education level, ordered by mean income.

hr_violin <- hr_cleaned %>% 
  ggplot(aes(x = reorder(education, monthly_income), y = (monthly_income))) +
   geom_violin() +
   geom_boxplot(width = 0.1) +
   labs(title = "Monthly Income and Education", subtitle = "Distribution and Summary Statistics", y = "Monthly Income ($)", x = "Education") +
    theme_economist()
hr_violin

5 Challenge 1: Replicating a chart

6 Challenge 2: 2016 California Contributors plots

CA_contributors_2016 <- vroom::vroom(here::here("data", "CA_contributors_2016.csv")) %>% 
  mutate(zip = as.character(zip))
zip_full <- vroom::vroom(here::here("zip_code_database.csv"))
zip_clean <- zip_full %>% 
  select(zip, primary_city, state)
contributors_zip <- left_join(CA_contributors_2016, zip_clean)

CA_clinton <- contributors_zip %>% 
  filter(cand_nm == "Clinton, Hillary Rodham") %>% 
  group_by(primary_city) %>% 
  summarise(total = sum(contb_receipt_amt)) %>% 
  arrange(desc(total)) %>% 
  head(10) %>% 
  ggplot(aes(x = total, y = reorder(primary_city, total))) + 
  geom_col(fill = "darkblue") +
  labs(title = "Clinton, Hillary Rodham", x = "Amount raised ($)", y = "") + 
  theme_bw() +
  theme()

CA_trump <- contributors_zip %>% 
  filter(cand_nm == "Trump, Donald J.") %>% 
  group_by(primary_city) %>% 
  summarise(total = sum(contb_receipt_amt)) %>% 
  arrange(desc(total)) %>% 
  head(10) %>% 
  ggplot(aes(x = total, y = reorder(primary_city, total))) + 
  geom_col(fill = "darkred") + 
  labs(title = "Trump, Donald J.", x = "Amount raised ($)", y = "") +
  theme_bw() +
  theme()

CA_clinton + CA_trump

CA_10 <- contributors_zip %>%
  group_by(cand_nm) %>%
  summarise(total_contb = sum(contb_receipt_amt)) %>%
  arrange(desc(total_contb))  %>%
  head(10)
CA_plot <- contributors_zip %>%
    filter(cand_nm %in% CA_10$cand_nm) %>%
    group_by(primary_city, cand_nm) %>%
    summarise(total_raised = sum(contb_receipt_amt)) %>% 
    group_by(cand_nm) %>%
    top_n(10) %>% 
    ungroup %>% 
    mutate(cand_nm = as.factor(cand_nm),
           primary_city = reorder_within(primary_city, total_raised, cand_nm)) 
ggplot(CA_plot, aes(primary_city, total_raised)) +
    geom_col() +
    facet_wrap(~cand_nm, scales = "free") +
    coord_flip() +
    scale_x_reordered() +
    scale_y_continuous() +
    theme_economist() +
    labs(title = "Amount Raised by City", subtitle = "By Candidate", x = "", y = "Amount Raised ($)", caption = "Source: FEC")

7 Details

Pablo Carrera Lorenzo, Celine Chi, Alex Kirk, Hans-Christian Preyer, Luca Toraldo, and Yurui Xu collaborated on this project.

We spent about 15 hours on this problem set.

The faceted California contributions plot gave us the most trouble.

8 Rubric

Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.

Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).

Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labeled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.